ST310 Course Project

The aim of this project is to demonstrate the application of machine learning techniques discussed in the ST310 Machine Learning module, in turn drawing primarily from the exposition in (James, Gareth and Witten, Daniela and Hastie, Trevor and Robert Tibshirani 2017) and (Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome 2009) among other texts. We examine the use of both linear techniques, e.g. Logistic Regression, and non-linear methods, e.g. Random Forests and Gradient Boosted Decision Trees, for a generic binary classification problem.

Remark: This report is best viewed as a .html file. If executing this notebooks as a .Rmd file, ensure that all libraries and dependencies are installed and that the chunks are executed in sequential order.

Dataset

We obtain the dataset from the Kaggle March Tabular Playground competition. The data consists of anonymised features, which correspond to a binary outcome variable; in other words the task is a classification problem. Although the data is anonymised, the challenge description states:

The dataset used for this competition is synthetic but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the amount of an insurance claim. Although the features are anonymized, they have properties relating to real-world features.

We believe that this machine learning task is well-motivated, as predicting the amount of an insurance claim or whether an insurance claim occurred, would have a meaningful business impact in the context of actuarial pricing and risk management.

# generic functions and plotting
library(tidyverse)
library(tidyr)
library(ggplot2) 
library(GGally) 
library(patchwork)
library(broom)
library(data.table)
library(doParallel)
library(foreach)
# modelling
library(caret)
library(tidymodels)
library(glmnet)
library(randomForest)
library(xgboost)
library(catboost)
# metrics
library(car) #outliers
library(pROC)
library(pdp)
# read in data
df <- read.csv(file = "../data/train.csv")
cat(c("The dimensions of the array are :", dim(df)[1], ", ", dim(df)[2]))
The dimensions of the array are : 300000 ,  32

The dataset consists of 30 features, 19 categorical variables and 11 numerical variables, and a binary outcome variable target. We will need to process these categorical variables in some manner, which we shall consider in the subsequent section.

Preprocessing

We remove the first column id, which represents an unique identifier for each observation. We then convert the first 19 columns, which are categorical variables into the factor data type in R. Given computational limitations (the code is executed on single CPU laptops), we subsample n_train = 4000 observations from the complete dataset of 300,000 observations. For our subsequent analysis, we will use only this subsample of 4000 observations, although the approaches can be naturally extended to a larger dataset given greater computational resources. We further partition the 4000 observations, into a train set of 3000 observations and a test set of 1000 observations. All training and tuning are performed on the training data, and we will consider model performance on the test set performance scores.

Remark: Different models would require different preprocessing for the categorical variables, and thus we will further address in subsequent sections; for example the glm package can accept factors as a data type, whereas for glmnet they must be converted to the model.matrix format.

Remark: When encoding the categorical variables as dummy variables in our subsequent analysis, we find that there are over 600 unique categories across all 19 of the categorical variables. We could also consider different methods to process categorical variables, such as merging less frequently occurring categories, or target encoding (Parr, Terence and Howard, Jeremy 2019, ch. 6). Nonetheless, we stick with the factors and dummies approach given our lack of familiarity with these other methods.

Feature Engineering

A challenge in this case is that we have no specific domain knowledge about the meaning of particular features, and whether they may be useful. Thus in this problem, any preprocessing in terms of feature engineering, and a priori feature selection may not be that meaningful. Nonetheless, we can resort to model interpretability methods (Molnar, Christoph 2019) to obtain an ex-post explanation of our models. In the linear case, we can examine coefficients and their statistical significance, and in the case of tree-based models, such as Feature Importances, Partial Dependence Plots, and Shapley values.

cat_feats = 1:19 # the first 19 columns are categorical 
cont_feats <- 20:30 # and the next 11 are continuous
target_col <- 31 # target
# convert cats to factors
df <- column_to_rownames(df, var = "id") %>% 
         mutate_if(is.character,as.factor) %>% 
         mutate_at(vars(target), factor)
# subsample the data for faster model inference
set.seed(1)
n_train <- 4000
df_sample = df[sample(1:nrow(df), n_train),]
# Partition data into train and test; test will be our oos data
set.seed(1)
df_split <- initial_split(df_sample, prop = 3/4)
df_train <- training(df_split)
df_test <- testing(df_split)

Exploratory Data Analysis

We first conduct some exploratory data analysis by visualising the univariate and bivariate relationships in the dataset. This allows us to gain some understanding of the data and potential modelling approaches.

Univariate EDA

We observe that the univariate distributions of the continuous variables are all multi-modal and non-normal, but they are all normalised (scaled) to the range of \([0, 1]\).

# flatten df into using pivot_longer and plot distribution
df %>% pivot_longer(cols = starts_with("cont"), names_to  = "cont") %>% 
   ggplot(aes(x = value))+
   geom_histogram(bins = 100, alpha = 0.85)+
   ggtitle("Continuous features distribution")+
   facet_wrap(cont~., scales = "free") +
   theme_minimal()

From the distributions of categorical variables, we see that there are variables with substantially more observations in one category, and also variables with a high number of (\(>50\)) categories, which will be an issue to address for any subsequent preprocessing.

# flatten df into using pivot_longer and plot distribution
df %>% pivot_longer(cols = contains(c("cat", "target")), names_to  = "cat") %>% 
   ggplot(aes(x = value))+
   geom_bar(alpha = 0.85)+
   ggtitle("Categorical features distribution")+
   facet_wrap(cat~., scales = "free", ncol = 4) +
   theme_minimal(base_size = 30)

Bivariate EDA

We group the continuous variables by the target and plot them as boxplots to check for any obvious differences discernible by eye. From the plots, claims with target == 1 have lower values of cont3 on average (median) than claims with target == 0. Claims with target=1 also have a much higher median value of cont4 than claims with target == 0. Hence, we would expect cont3 and cont4 to have negative relationships with target. In addition, we see a group of observations at the tail ends for cont0, cont5, cont7, cont8, cont9, cont10, which may be indicative of outliers.

# flatten df into using pivot_longer
# group by target and plot distribution
df[, c(cont_feats, target_col)] %>% 
  pivot_longer(cols = starts_with("cont"), names_to  = "var", values_to="value") %>% 
  ggplot(aes(x=target,y=value), fill=factor(value)) + 
  geom_boxplot() + coord_flip() + facet_wrap(~var, scales="free_x")

From our conditional boxplots, we can identify potential outliers by filtering observations that lie at the extreme percentiles of those continuous variables. In particular we note that there are many values at the extreme quantiles for the variables cont0, cont5, cont7, cont8, cont9, cont10. We investigate this further using the Hampel filter, which considers points lying outside the median plus or minus 3 mean absolute deviations as outliers. Through this approach, more than 200 observations per variable would be identified as outliers, which suggests that these may not be outliers but that the distribution is just heavy-tailed. Without further information on the reasonable scale of values that individual variables can take (along with the fact that they are all normalised), we find it challenging to identify outliers through descriptive statistics and decide not to exclude any such points using this approach. We will proceed to detect outliers in another approach in subsequent modelling.

# MAD filter
hampel_filter <- function(df){
   lower_bound <- median(df) - 3 * mad(df, constant = 1)
   upper_bound <- median(df) + 3 * mad(df, constant = 1)
   outlier_ind <- which(df < lower_bound | df > upper_bound)
   return(outlier_ind)
}
# quantile filter
percentile_filter <- function(df, lq = 0.001, uq = 0.999){
   lower_bound <- quantile(df, lq)
   upper_bound <- quantile(df, uq)
   outlier_ind <- which(df < lower_bound | df > upper_bound)
   return(outlier_ind)
}
# outlier count
hampel_count <- function(x){length(hampel_filter(x))}
pct_count <- function(x){length(percentile_filter(x))}
outlier_counts <- df_train[, cont_feats] %>% map_dfr(hampel_count)
outlier_counts[2, ] <- df_train[, cont_feats] %>% map_dfr(pct_count)
outlier_counts

For categorical variables, we use stacked bar plots to show the percentages of observations in each category that correspond to target == 0 and target == 1 respectively. A much larger proportion of claims with cat13 == B appear to be associated with target == 1 compared to cat13 == A. On the other hand, a much larger percentage of claims correspond to target == 0 if cat18 is A or B, than if cat18 is C or D.

# flatten df into using pivot_longer
# group by target and plot distribution
df[, c(cat_feats, target_col)] %>% 
  pivot_longer(cols = starts_with("cat"), names_to  = "cat", values_to="value") %>% 
  ggplot(aes(x = value, fill=target)) + 
    geom_bar(position="fill") + 
    scale_y_continuous(name = "Within group Percentage", labels = scales::percent) +
    facet_wrap(~cat, scales="free_x", ncol = 4) +
    theme_minimal(base_size = 40)

We also inspect the correlation matrix for our continuous variables. There seems to be a cluster of variables cont1, cont2, cont8 that are highly correlated with each other. This may be potentially indicative of multicollinearity, a concern when using linear models. In addition, we also consider the (Pearson) correlation and our continuous variables, although it is not necessarily meaningful in this case given our target is binary.

We note that the continuous variables have Pearson correlations of roughly \([-0.2, 0.2]\), suggesting that there is some small signal or information between the features and the target. However, the Pearson correlation only describes a linear and pairwise association between the variables. As such there could also be complex non-linear associations and interactions between the variables, which the presence of multi-modal distributions of the continuous features in our univariate EDA may also affirm.

#heatmap, high values in red, low in yellow
cor_matrix <- cor(df[, cont_feats])
heatmap(cor_matrix, main="Correlation Matrix (Clustered)")

cor_with_target <- rownames_to_column(data.frame(cor(df[, c(cont_feats)], as.numeric(df$target))))
names(cor_with_target) <- c("feature", "pearson_corr")
cor_with_target %>%
  ggplot(aes(x=reorder(feature, -pearson_corr), y = pearson_corr)) + 
  geom_bar(stat='identity') + coord_flip() + ggtitle("Pearson Correlation of Continuous with Target")

We can use Principal Components Analysis (PCA) as a means to visualise the data in low-dimension, and examine whether there are any explicitly discernible trends. We apply PCA to all the continuous features, without further pre-scaling is required given the continuous features have values from \([0, 1]\). By eye, there do not appear to be any significant difference in the PCA representations for each class, although there are many observations with target == 1 closer to the bottom of the ellipsoid formed by the first two PCA loadings. At least in the space formed by the first two princial components, the classes do not appear to be linearly separable, - which suggest a non-linear method may be more effective on this classification task.

On examination of the cumulative explained variance ratio, the variation in \(\mathbf{X}\) captured by \(k\) principal components, we find that the first two principal components only capture about \(\approx 60\%\) variation in the continuous variables. To capture \(\approx 90\%\) of the variation, we would need 6 of the continuous features, and for $% we would need 9. This could suggest that including more features could be beneficial, which would be a concern if we are to do feature selection.

# pca loadings
pcs <- prcomp(df[,cont_feats])
set.seed(2021)
data.frame(pc1=pcs$x[,1], pc2=pcs$x[,2], target=df[, "target"]) %>%
ggplot(aes(x = pc1, y = pc2, colour = target)) + 
  geom_jitter(alpha=0.7) + ggtitle('Principal Components')

# cumulative explained variance ratio
cumul_var <- cumsum(pcs$sdev^2 / sum(pcs$sdev^2))
ggplot(data.frame(feature = 1:11, cumul_var = cumul_var)) + 
  geom_line(aes(x = feature,y = cumul_var)) + ggtitle("Cumulative Explained Variance Ratio") +
  scale_y_continuous(labels=percent) + xlab("Number of Principal Components") + 
  ylab("Cumulative explained Variance Ratio")

Modelling

We first consider a naive model that always predicts a single label (in this case the greater accuracy is obtained by always predicting \(0\), given it is the most frequently occurring class). The accuracy in this case, the proportion of correct labels \(\sum_{i = 1}^{n} I(y_{i} = \hat{y_{i}})\), would be \(1 - \hat{y} = 0.745\). This illustrates a potential issue with the accuracy metric - a “high” accuracy may be reflective of the distribution of the target and not the model performance itself, so when comparing model accuracies, an appropriate baseline is needed.

We also consider the ROC-AUC metric (Receiving Operator Characteristic Area Under the Curve). Given the problem is one related to insurance, the modelling outcome of interest is not only to obtain the correct predictions (accuracy), but also accurate probabilities, which the AUC metric captures to some extent.

The AUC metric calculates the area under the ROC-curve, the plot of the true positive rate or sensitivity \(\frac{TP}{TP + FN}\) against the false positive rate \(\frac{FN}{TN + FP}\), assessing the performance of the classifier’s predicted probabilities across all possible decision thresholds. A classifier that always predicts 0s would result in a baseline AUC of \(0.5\).

As we do not know the specific threshold relevant to the insurance context of the problem, we will report both accuracy, with a decision threshold set to \(> 0.5\), and the AUC score for all models considered. In practice, the decision threshold would largely depend on the requirements of the modeller.

# train-test
X_train = as.matrix(df_train[, cont_feats])
y_train = as.numeric(as.matrix(df_train$target))
X_test = as.matrix(df_test[, cont_feats])
y_test = as.numeric(as.matrix(df_test$target))
# helper function for metrics
evaluate_metrics <- function(train_pred, test_pred, train_true, test_true){
  train_classes <- ifelse(train_pred > 0.5, 1,0)
  test_classes <- ifelse(test_pred > 0.5, 1,0)
  acc1 <- mean(train_classes == train_true)
  auc1 <- auc(roc(train_true, train_pred, quiet=TRUE))
  acc2 <- mean(test_classes == test_true)
  auc2 <- auc(roc(test_true, test_pred, quiet=TRUE))
  data.frame(train_acc=acc1, train_auc=auc1, test_acc = acc2, test_auc=auc2)
}
# metrics for naive prediction
results = data.frame(evaluate_metrics(rep(0, dim(df_train)[1]), 
                               rep(0, dim(df_test)[1]), 
                               df_train$target, df_test$target),
                     row.names=c("naive"))
results

Logistic Regression (SGD)

To fulfill the project requirements, we implement a `from scratch’ Stochastic Gradient Descent routine for Logistic Regression. The derivation follows (Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome 2009, 120–26)

Consider a matrix of variables \(X = (x_{1}, x_{2}, \ldots x_{n})^{T}\), where \(x_{i}\) denote the i-th row or observation, the target \(\mathbf{y} = (y_{1}, y_{2}, \ldots, y_{n})\). Let \(\frac{p(x_{i}; \beta)}{1 - p(x_{i} ; \beta)} = exp(\beta^{T} x_{i})\) denote the odds ratio.

We are interested in finding the coefficients \(\beta\), such that the logistic loss (also referred to as binary cross-entropy, and equivalent to the negative log-likelihood) is minimised. The logistic loss is given by:

\[\begin{align}l(\boldsymbol{\beta}) &= -\sum_{i = 1}^{N} y_{i} log(p(x_{i} ; \boldsymbol{\beta})) + (1 - y_{i}) log(1 - p(x_{i} ; \boldsymbol{\beta}))\\ &= \sum_{i = 1}^{N} \left [ y_{i}log \left (\frac{p(x_{i} ; \boldsymbol{\beta})}{1 - p(x_{i} ; \boldsymbol{\beta})} \right) + log(1 - p(x_{i} ; \boldsymbol{\beta})) \right ]\\ l(\boldsymbol{\beta}) &= -\sum_{i = 1}^{N}\left [y_{i} \boldsymbol{\beta}^{T} x_{i} - log(1 + exp(\boldsymbol{\beta}^{T}x_{i})) \right ] \tag{1} \end{align}\]

With the inclusion of a regularisation term, specifically a \(L^{2}\) penalty, the loss function becomes:

\[ \begin{equation}l(\boldsymbol{\beta}) = -\sum_{i = 1}^{N} \left [y_{i} \boldsymbol{\beta}^{T} x_{i} - log(1 + exp(\boldsymbol{\beta}^{T}x_{i})) \right ] - \lambda \boldsymbol{\beta}^{T} \boldsymbol{\beta} \tag{2} \end{equation}\]

The gradient of the loss function with respect to the coefficients \(\mathbf{\beta}\) is given by:

\[\nabla(\boldsymbol{\beta}) = -\sum_{i = 1}^{N} \left [ y_{i} x_{i} - \frac{x_{i}exp(\boldsymbol{\beta}^{T} x_{i})}{1 + exp(\boldsymbol{\beta}^{T} x_{i})} \right] - \lambda 2\boldsymbol{\beta}\]

We apply (stochastic) gradient descent. In short, this relies on updating the coefficients \(\beta\) iteratively based on a step size \(\lambda\):

\[\beta_{t + 1} = \beta_{t} - \lambda \nabla(\boldsymbol{\beta}_{t})\]

More sophisticated approaches such as Momentum or Adam can be used for the iterative updates; these are better outlined in (Nisheeth K. Vishnoi 2020). In our simple implementation, we use the Barzilai-Borwein method (Murphy, Kevin P. 2012, 444–45) to determine the step size.

\[\lambda_{t} = \frac{|(\beta_{t} - \beta_{t - 1})^{T}(\nabla F(\beta_{t}) - \nabla F(\beta_{t - 1}))|}{\| \nabla F(\beta_{t}) - \nabla F(\beta_{t - 1}))\|^{2}}\] Given sufficient iterations and appropriate step size, the model should converge (i.e. the change in the coefficients \(\|\beta_{t + 1} - \beta_{t}\|\) or the change in loss is less than some threshold \(\epsilon\)), although this is not guaranteed.

# binary crossentropy / log-loss
log_loss <- function(x, y, betas, lambda){
  logits <- x %*% betas
  - (t(y) %*% logits - sum(log(1 + exp(logits))) + lambda * t(betas) %*% betas) / dim(x)[1]
}
# logistic regression gradients
gradients <- function(x, y, betas, lambda){
  logits <- x %*% betas
  - (t(x) %*% (y - exp(logits)/(1 + exp(logits)))) - lambda *2 * betas / dim(x)[1]
}
# pre set parameters
p = dim(X_train)[2]
lambda = 0
n_iters <- 100
init_step_size <- 1e-6
set.seed(2021)
beta_init <- matrix(rnorm(p),nrow=p)
beta_path <- matrix(rep(0, n_iters * p), nrow = n_iters, ncol=p)
beta_path[1,] = beta_init
last_grad <- grad <- gradients(X_train, y_train, beta_path[1,], lambda)
beta_path[2,] = beta_init - init_step_size * grad
grad <- gradients(X_train, y_train, beta_path[2,], lambda)
losses <- rep(0, n_iters)
# iteratively update betas
for (i in 3:n_iters){
    step_size <- as.numeric(t(beta_path[i - 1,] - beta_path[i - 2,]) %*% (grad - last_grad) / 
                    (t(grad - last_grad) %*% (grad - last_grad)))
    beta_path[i,] <- beta_path[i - 1,] - step_size * grad
    last_grad <- grad
    grad <- gradients(X_train, y_train, beta_path[i, ], lambda)
    losses[i] <- log_loss(X_train, y_train, beta_path[i,], lambda)
}
# plot + results
ggplot(data.frame(step = 3:n_iters, loss=losses[3:n_iters])) + 
  geom_line(aes(x = step, y = loss)) +
  ggtitle("Binary Crossentropy vs. Iterations")

pred_train <- as.numeric(1 / (1 + exp(-X_train %*% beta_path[100,])))
pred_test <- as.numeric(1 / (1 + exp(-X_test %*% beta_path[100,])))
results["sgd",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["sgd",]

We apply our “from scratch” implementation of Logistic Regression using SGD with all the numerical variables, obtaining a test accuracy of \(0.751\) and a test AUC of \(0.7082\), a small improvement in terms of accuracy from the naive strategy, but a larger improvement in terms of AUC.

Remark: We note that our implementation of Logistic Regression is quite limited, for example an inability to process the factor datatype (unless we append a one-hot encoded matrix), or an inability to examine coefficient standard errors (unless we change to a Hessian based approach / Newton’s method). Given the limitations of our implementation, we shall subsequently use the glm package for logistic regression, and the glmnet package for regularised logistic regression.

Logistic Regression (Few Predictors)

Having explored the efficacy of logistic regression on this modelling task, we now utilise a more robust implementation of logistic regression via the glm package. We consider a more parsimonious logistic regression model, hand-picking several features based on our exploratory data analysis. We select the features that have discernible differences in their distributions conditioned on the target: the predictors cont3, cont4, cat13, cat18.

By selecting few predictors, coupled with the in-built features of the glm implementation, we can obtain, to some extent, interpretability and understanding of the model. This will help guide our analysis, and identify any potential issues before expanding to more predictors or different models. We will also use this as a baseline to compare the performance of subsequent models with.

glm_base <- glm(target~cont3+cont4+cat13+cat18, data = df_train, family=binomial(link="logit"))
summary(glm_base)

Call:
glm(formula = target ~ cont3 + cont4 + cat13 + cat18, family = binomial(link = "logit"), 
    data = df_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5707  -0.6702  -0.6021   0.3931   2.0095  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.4821     1.0756  -1.378  0.16824    
cont3        -0.6554     0.2276  -2.880  0.00398 ** 
cont4        -0.4772     0.2063  -2.313  0.02070 *  
cat13B        1.8681     0.3346   5.584 2.36e-08 ***
cat18B        0.5581     1.0714   0.521  0.60243    
cat18C        2.6081     1.0816   2.411  0.01590 *  
cat18D        2.4342     1.0805   2.253  0.02427 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3406.6  on 2999  degrees of freedom
Residual deviance: 2962.8  on 2993  degrees of freedom
AIC: 2976.8

Number of Fisher Scoring iterations: 4
vif(glm_base)
          GVIF Df GVIF^(1/(2*Df))
cont3 1.110907  1        1.053996
cont4 1.053938  1        1.026615
cat13 1.010521  1        1.005247
cat18 1.066127  3        1.010729
ggcoef(glm_base)

Firstly, the predictors are almost all statistically significant at a 5% level, with the exception of cat18B and the intercept. Inspecting the coefficient plots, which visualise the confidence intervals of each coefficient estimate, we observe that the confidence intervals of all variables except cat18B are significant at 5% level, and do not include 0. We also see that the width of the confidence intervals of the coefficients of cat18D and cat18C, which are proportional to their variances, are much larger than that of cat13B, cont4 and cont3, however as the categorical and continuous variables are on completely different scales, these are not directly comparable. It could be that this is a result of subsampling, and that as we increase the sample size the standard errors of the coefficients would decrease.

Secondly, we inspect the generalised variance inflation factor (GVIF). The variance inflation factor captures the extent to which the standard error of a predictor is increased due to its correlation with other predictors in the model, corrected by the number of degrees of freedom corresponding to the number of levels in the categorical variables. The GVIF of all predictors in our model are less than 2, which suggests there is no evidence of multicollinearity. Although our analysis of the variance inflation factor suggests that there is no multicollinearity, a quick inspection of the variable cat18 indicates that there are very few observations in the level cat18A; thus cat18B would be approximately close to 1 - cat18C - cat18D and potentially collinear when the intercept is considered.s This suggests that a more robust method to encode the categorical variables could be required, such as merging infrequent categories, or perhaps using regularisation instead.

table(df_train$cat18)

   A    B    C    D 
   8 2584  194  214 

Assuming that there is no multicollinearity, the coefficients for each predictor represents the association of that predictor with the log-odds of having target == 1: \(\log \left ( \frac{P(\text{target}=1)}{P\text{(target}=0)} \right )\).

The coefficients of the dummy variables indicate the average difference between the log odds of that factor level group compared to the baseline level group. In our regression, the first category (A) of each categorical variable is taken as the baseline group. For example, the coefficient of cat13B implies the following equation: \[\log\frac{P(target=1|cat13=B)}{P(target=0|cat13=B)}-\log\frac{P(target=1|cat13=A)}{P(target=0|cat13=A)}=1.8681\]

This suggests that claims with cat13=B have \(exp(1.8681)-1=5.48\) higher odds than claims with cat13=A. Given that the coefficient of cat18B is not statistically significant (or in other words, its standard error is high), this suggests that its association with target may not be significantly different from cat18=A (inspecting the conditional plot from our exploratory data analysis, their conditional distributions on the target appear to be similar). However, having cat18=C as opposed to having cat18=A is associated with a \(exp(2.6081)-1=12.6\) larger increase in the log-odds of having target==1. Similarly, claims with cat18=D are associated with a \(exp(2.4342)-1=9.4\) greater log-likelihood of having target==1.

To interpret the coefficients of the continuous variables, log odds need to be calculated using specific pairs of values of the predictors. The non-linearity of the logistic function means that a change of one unit in the value of a predictor is not the same across the range of the predictor. cont3 and cont4 have negative coefficients as expected, and are interpreted as follows. A one unit increase in cont3 is associated with an \(exp(-0.6554)=0.519\) multiplicative effect on the odds, i.e. a 48% decrease in odds compared to the previous odds. Similarly, a claim with a one unit higher value of ‘cont4’ is \(1 - exp(-0.4772)=37.9\%\) less likely to have target==1 than a claim with a one unit lower value of cont4. As the true range of each continuous variable before normalisation is unknown, it is unclear whether this associated change in target is considered large in reality.

pred_train <- predict(glm_base, df_train, type="response")
pred_test <- predict(glm_base, df_test, type="response")
results["glm-base",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["glm-base",]

With just a few predictors, the baseline model has a test accuracy of \(80.3\%\), which is \(7.5\%\) higher than the test accuracy of the naive model. The test AUC of the baseline model is \(0.703\), which is 0.203 higher than that of the naive model, which is a fair improvement.

Logistic Regression (All Predictors)

We now run a logistic regression using all 31 predictors. The regression coefficients can be interpreted in a similar way as the baseline model. The regression output shows that all predictors are significant at 5%, but this is unreliable since the categorical variables have many levels (623 in total).

glm_full <- glm(target~., data=df_train, family=binomial(link="logit"), 
            control = list(maxit = 100))
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
pred_train <- predict(glm_full, df_train, type="response")
prediction from a rank-deficient fit may be misleading
glm_full$xlevels = lapply(df[,cat_feats], levels)
pred_test <- predict(glm_full, df_test, type="response")
prediction from a rank-deficient fit may be misleading
results["glm-full",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["glm-full",]

The full model has a higher train accuracy (0.801) but lower test accuracy (0.59) than the baseline model, which implies overfitting. The test AUC is 0.543, which is much lower than the baseline. This is an example of the bias-variance trade-off; the more complex, full model with 31 predictors has a lower bias but higher variance than the simpler, baseline model with 4 predictors.

We also obtain an error: “prediction from a rank-deficient fit may be misleading” which suggests multicollinearity. In subsequent sections, we aim to overcome this to improve the stability of our model using shrinkage in a subsequent section.

Outlier Detection

At this stage, we suspect that some outliers might be heavily influencing the performance of our models, and hence we seek to detect outliers using a few measures. Formally, outliers are defined as observations with a response vector that is unusual conditional on covariates (predictors). Firstly, we look for points with large (studentised) residuals. We can then test if these residuals are significantly larger than those of other points by examining the Bonferroni-adjusted p-values. Ten points are identified to have large studentised residuals with adjust p-values less than 0.05. We store the indices of these points to be removed later.

outlierTest(glm_full)
outliers <- as.numeric(names(outlierTest(glm_full)$p))

Observations that are far from the average covariate pattern are considered to have high leverage and can be measured using hat values. Here, there are at least 100 points with high leverage. Finally, we also measure for influence, which is an observation that is an outlier and have high leverage. These are likely to influence the regression coefficients and influence can be thought of as the product of leverage and outlier. Here, we plot studentised residuals against hat-values with the size of a circle being proportional to the Cook’s distance of an observation- a measure of influence.

influenceIndexPlot(glm_full, vars = "hat")
Error in influenceIndexPlot(glm_full, vars = "hat") : 
  could not find function "influenceIndexPlot"

We observe that there are 4 observations with high influence. We remove these observations along with the points identified as outliers earlier (14 in total), and compare the performance of our updated model with the original model (see Appendix).

We find that removing outliers (see Appendix) leads to a higher test accuracy of 0.684 but a lower test AUC of 0.539. We cannot interpret this further without knowing the specific threshold used to classify the target in the data. Analysing the regression output, we see that leaving out the outliers does not change the coefficient estimates, and since we have no intuitive reason to believe that these outlier values are extreme (due to no knowledge about the variables themselves), we decided to keep these outlier data points for all other models. Furthermore, it is possible that these outliers may be less impactful as we increase the sample size.

Regularised Logistic Regression

Given the issue of high dimensionality, we consider a regularised form of logistic regression. Specifically, we consider ridge (logistic) regression, with the functional form as in equation (2) This method shrinks coefficients by imposing a penalty on their size, thereby reducing model complexity. We use the implementation offered by the glmnet package; in doing so we need to convert the data type into matrices.The glmnet package requires the data to be in a matrix data type, and hence we make the corresponding adjustment.

X_train = df_train[, -length(df_train)]
y_train <- df_train$target
X_test = df_test[, -length(df_test)]
y_test <- df_test$target
X_train = model.matrix(~., X_train)
X_test = model.matrix(~., X_test)
glm_reg <- cv.glmnet(X_train, y_train, 
                  family="binomial"(link="logit"), alpha=0)
pred_train <- as.numeric(predict(glm_reg, X_train, type="response"))
pred_test <- as.numeric(predict(glm_reg, X_test, type="response"))
results["glm-ridge",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["glm-ridge",]

Random Forest

To improve performance, we draw on the usage of non-linear tree-based models, specifically random forests(Leo Breiman 2001). Intuitively, a random forest averages different decision trees (known as bagging) so as to reduce the variance of individual trees. This thus reduces the likelihood of individual trees overfitting to the training data.

rf_recipe <- recipe(target~., data = df_train)

rf_model <- 
  rand_forest(trees=100) %>%
  set_engine("ranger", importance="impurity", seed=2021) %>%
  set_mode("classification")

rf_workflow <- workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_model)

rf1 <- fit(rf_workflow, df_train)
ranger_obj <- pull_workflow_fit(rf1)$fit
# plot feature importances
rf_feat_imp <- rownames_to_column(data.frame(ranger_obj$variable.importance));
names(rf_feat_imp) <- c("feat", "importance")
ggplot(rf_feat_imp, aes(x = reorder(feat, importance), y = importance))+ 
      geom_bar(stat="identity", position="dodge")+ coord_flip()+
      ylab("Feature Importance (Gini Impurity)")+
      xlab("Feature")+
      ggtitle("Random Forest Feature Importances")
#evaluate metrics
pred_train <- unlist(predict(rf1,df_train,type='prob')[,2])
pred_test <- unlist(predict(rf1, df_test, type='prob')[,2])
results["rf",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["rf",]
temp <- function(x){partial(ranger_obj, train=df_train, pred.var = x, plot = TRUE, plot.engine = "ggplot2", paropts = list(.packages = "ranger"))}
plots <- lapply(names(df_train)[cont_feats], temp)
wrap_plots(plots)

The random forest has a test accuracy of 0.836 and a test AUC of 0.872, which is higher than the previous linear models. However, the train accuracy and AUC are significantly higher than the test performance, which suggest that there may be some degree of overfitting to the train data.

Random forests can be used to rank the importance of different features. Specifically, the x-axis is the Mean Decrease Accuracy, which reports how much accuracy the model loses when we exclude this variable. The more the accuracy falls by, the more important the particular variable is. Note here that for categorical variables, each level of the category is classified as a single variable. In this plot, we recorded the 30 most important variables.

We then run another random forest with the most important features. In doing so, we hope to reduce the degree of overfitting by reducing the complexity of the model. However, it does not make sense to drop some levels of a categorical variable while including the other levels. Hence, as long as a level is present in the top 30 features, we will include the entire category in our updated random forest model. This results in us keeping only 22 variables, from an initial 30.

Our reduced random forest has a slightly improved test accuracy and a slightly decreased test AUC. However, it does not solve the potential problem of overfitting as train performance is still significantly better than test performance. In fact, train performance on the reduced random forest is better than the random forest with a full set of variables - the reduced complexity of the model enabled it to have a lower bias on the training set.

XGBoost

For completeness, we also consider the xgboost library for gradient boosted decision trees. Gradient Boosted Decision Trees. The XGBoost package, introduced in (Chen, Tianqi and Guestrin, Carlos 2016) is a variant of Gradient Boosted Decision Trees (Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome 2009, 353–74). For Gradient Boosting Trees, each decision tree is trained sequentially on the residuals of the previous decision tree, which has the effect of reducing bias.

dmy_train <- dummyVars("~.", data = df_train[,-length(df_train)])
dmy_test <- dummyVars("~.", data = df_test[,-length(df_test)])
X_train <- as.matrix(data.frame(predict(dmy_train,df_train)))
X_test <- as.matrix(data.frame(predict(dmy_test,df_test)))
y_train = as.integer(as.matrix(df_train$target))
y_test = as.integer(as.matrix(df_test$target))
bst <- xgboost(data = X_train, label=y_train, max_depth = 2, nround = 10, 
               verbose=0,
               objective='binary:logistic',
               eval_metric="logloss")

pred_train <- predict(bst, X_train, type="response")
pred_test <- predict(bst, X_test, type="response")
results["xgb",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["xgb",]

importance_matrix <- xgb.importance(model=bst)
xgb.plot.importance(importance_matrix)

# shapley values  
xgboost::xgb.ggplot.shap.summary(X_test, model = bst, 
                                 target_class = 1, top_n = 20)  # Summary plot

CatBoost

We also consider briefly examine the catboost library for gradient boosted decision trees, given its popularity on machine learning competitions such as Kaggle. The CatBoost library has the advantage of learning a target encoding, i.e. some numerical value for every category for each categorical variables. From an implementation point of view, this may reduce the preprocessing that may be required. For details on CatBoost, refer to (Prokhorenkova, Liudmila and Gusev, Gleb and Vorobev, Aleksandr and Dorogush, Anna Veronika and Gulin, Andrey 2017).

Again, the key parameters are the number of trees (iterations), and the depth of each tree. Due to computational limtations, we select arbitrary parameters iterations = 10 and depth = 8; these could be further fine-tuned using hyperparameter optimisation.

X_train = df_train[,c(cat_feats, cont_feats)]
y_train = as.integer(df_train$target)
X_test = df_test[,c(cat_feats, cont_feats)]
y_test = as.integer(df_test$target)

pool <- catboost.load_pool(X_train, y_train, cat_features = cat_feats)
model <- catboost.train(pool, params=list(depth = 8, iterations = 10, 
                                          loss_function='Logloss', verbose=0))

pred_train <- catboost.predict(model, catboost.load_pool(X_train), prediction_type = 'Probability')
pred_test <- catboost.predict(model, catboost.load_pool(X_test), prediction_type = 'Probability')

# feature importance
feat_importance <- catboost.get_feature_importance(model, pool)
importances <- data.frame(feat_importance[order(feat_importance, decreasing=FALSE),])
importances$features = rownames(importances)
names(importances) <- c("importance","features")
importances$features <- factor(importances$features, level=importances$features)
ggplot(importances, aes(x=importance, y=features)) + 
  geom_bar(stat="identity") +
  ggtitle("CatBoost Feature Importance")


#Shapley Values
data_shap_tree <- catboost.get_feature_importance(model, pool = pool,
                                                  type = "ShapValues")
data_shap_tree <- data.frame(data_shap_tree[, -ncol(data_shap_tree)]) 
names(data_shap_tree) = names(df[, c(cat_feats, cont_feats)])

# ggplot(stack(data_shap_tree), aes(x = ind, y = values)) +
#     geom_point(aes(color = values)) + coord_flip() + 
#     ggtitle("Shapley Values by variable")  + scale_color_viridis_c()

results["catboost",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["catboost",]

We obtain a train accuracy of \(0.882\) and a train AUC of \(0.921\), versus a test accuracy of \(0.836\) and test AUC \(0.858\). This suggests that the model may be overfitting for our particular choice of paramters (and also for our subsampled dataset size).

For interpretability, we can use CatBoost’s feature importance. From the CatBoost documentation, CatBoost’s in-built and default feature importance does the following:

For each feature, PredictionValuesChange shows how much on average the prediction changes if the feature value changes. The bigger the value of the importance the bigger on average is the change to the prediction value, if this feature is changed.

We could also set the type of feature importance to Shapley Values, although we omit the plot in this case due to the lack of in-built support for plotting.

Results and Evaluation

results[order(results$test_auc, results$test_acc),]

Although the tree-based methods are less interpretable and more “black box” to some extent, we can make use of model interpretability techniques.

It could be that our resutls are not fully robust, given we have subsampled the original data.

Conclusion

In this project, we demonstrated the use of logistic regression, gradient descent, regularisation, random forest, gradient boosting and hyperparameter tuning techniques to achieve interpretability and model accuracy (in separate cases) in predicting the amount of an insurance claim. Model ___ was the model with highest AUC, although interpretability was sacrificed. On the other hand, our baseline model with only a few predictors is easy to interpret, however it has a relatively low AUC of 0.703. In practice, depending on whether the goals of the company lean towards prediction or explanation, an appropriate trade-off between model complexity and interpretability has to be chosen.

From the insurer’s profit perspective, understanding which characteristics are highly correlated with large amounts of claims could also be used to inform whether or not a particular insurance application is accepted, even if causality cannot be inferred. However, this might lead to some legal and/or ethical issues. For example, the UK’s 2010 Equality Act makes discriminating on nine protected characteristics, including age, gender, race and religion, illegal (Pattni, 2020). Therefore, proper data governance needs to be maintained and privacy impact assessments need to be conducted before any machine learning models of this kind are deployed in industry (GDPR, 2018).

Bibliography

Chen, Tianqi and Guestrin, Carlos. 2016. “XGBoost: A Scalable Tree Boosting System.” https://arxiv.org/pdf/1603.02754.pdf.
Efron, Bradley and Hastie, Trevor. 2016. Computer Age Statistical Inference. Cambridge University Press.
Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
James, Gareth and Witten, Daniela and Hastie, Trevor and Robert Tibshirani. 2017. An Introduction To Statistical Learning. Springer.
Leo Breiman. 2001. “Random Forests.” https://link.springer.com/content/pdf/10.1023/A:1010933404324.pdf.
Molnar, Christoph. 2019. Interpretable Machine Learning. “https://christophm.github.io/interpretable-ml-book”.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press. https://christophm.github.io/interpretable-ml-book.
Nisheeth K. Vishnoi. 2020. Algorithms for Convex Optimization. https://convex-optimization.github.io/.
Parr, Terence and Howard, Jeremy. 2019. The Mechanics of Machine Learning. https://mlbook.explained.ai/.
Prokhorenkova, Liudmila and Gusev, Gleb and Vorobev, Aleksandr and Dorogush, Anna Veronika and Gulin, Andrey. 2017. “CatBoost: Unbiased Boosting With Categorical Features.” https://arxiv.org/pdf/1706.09516.pdf.

Appendix

Comparison of Coefficients with Outliers Removed

influencers <- as.numeric(rownames(influencePlot(glm_full)))
glm_full_influencers <- update(glm_full, subset = c(-influencers))
glm_full_outliers <- update(glm_full, subset = c(-outliers))
removal_list <- union(outliers, influencers)
glm_full_removed <- update(glm_full, subset = c(-removal_list))
compareCoefs(glm_full, glm_full_influencers, glm_full_outliers, glm_full_removed)
set.seed(2021)
cv_splits <- rsample::vfold_cv(df_train, strata = target, v= 3)
mod <- logistic_reg(penalty = tune(),
                    mixture = tune()) %>%
  set_engine("glmnet")

glmnet_recipe <- recipe(target~.,data = df_train) %>% 
  step_dummy(all_nominal(), -all_outcomes())

glmnet_workflow<- workflow() %>%
  add_recipe(glmnet_recipe) %>%
  add_model(mod)

glmn_set <- parameters(penalty(range = c(-5,1), trans = log10_trans()),
                       mixture())

glmn_grid <- 
  grid_regular(glmn_set, levels = c(7, 5))
ctrl <- control_grid(save_pred = TRUE, verbose = TRUE)

glmn_tune <- 
  tune_grid(glmnet_workflow,
            resamples = cv_splits,
            grid = glmn_grid,
            metrics = metric_set(roc_auc),
            control = ctrl)


best_glmn <- select_best(glmn_tune, metric = "roc_auc")

glmnet_model <- 
  logistic_reg() %>%
  set_engine("glmnet", seed=2021) %>%
  set_mode("classification")



glm_reg <- fit(glmnet_workflow, data = df_train)
xgb_spec <- boost_tree(
  trees = 100, 
  tree_depth = tune(),
  learn_rate = tune()                         ## step size
) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  learn_rate(),
  size = 5
)

xgb_wf <- workflow() %>%
  add_formula(target ~ .) %>%
  add_model(xgb_spec)

set.seed(123)
vb_folds <- vfold_cv(df_train, strata = target)

doParallel::registerDoParallel()

set.seed(234)
xgb_res <- tune_grid(
  xgb_wf,
  resamples = vb_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)
best_auc <- select_best(xgb_res, "roc_auc")
final_xgb <- finalize_workflow(
  xgb_wf,
  best_auc
)
library(vip)

final_xgb %>%
  fit(data = vb_train) %>%
  pull_workflow_fit() %>%
  vip(geom = "point")

final_res <- last_fit(final_xgb, df_split)

collect_metrics(final_res)

xgb_res %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(mean, mtry:sample_size) %>%
  pivot_longer(mtry:sample_size,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC")
"https://curso-r.github.io/treesnip/index.html"
library(treesnip)
---
title: "ST310 Course Project"
author: "Candidate Numbers: 10936, 12836, 14359"
date: "`r format(Sys.time(), '%d/%m/%y')`"
output:
  html_notebook:
    toc: true
    toc_float: true
  html_document:
    toc: true
    toc_float: true
    df_print: paged
    keep_md: true
  pdf_document: default
header-includes:
- \usepackage {hyperref}
- \hypersetup {colorlinks = true, linkcolor = blue, urlcolor = blue}
references:
- id: hastie2009elements
  title: 'The Elements of Statistical Learning: Data Mining, Inference, and Prediction'
  author: "Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome"
  publisher: Springer
  type: book
  issued:
    year: 2009
- id: james2017introduction
  title: "An Introduction To Statistical Learning"
  author: "James, Gareth and Witten, Daniela and Hastie, Trevor and Robert Tibshirani"
  publisher: Springer
  type: book
  issued:
    year: 2017
- id: efron2016computer
  title: "Computer Age Statistical Inference"
  author: "Efron, Bradley and Hastie, Trevor"
  publisher: Cambridge University Press
  type: book
  issued:
    year: 2016
- id: molnar2019
  title: "Interpretable Machine Learning"
  author: "Molnar, Christoph"
  type: book
  issued:
    year: 2019
  URL: “https://christophm.github.io/interpretable-ml-book”
- id: murphy2012machine
  title: "Machine Learning: A Probabilistic Perspective"
  author: "Murphy, Kevin P."
  publisher: MIT Press
  type: book
  issued:
    year: 2012
  URL: "https://christophm.github.io/interpretable-ml-book"
- id: prokhorenkova2017catboost
  title: "CatBoost: Unbiased Boosting With Categorical Features"
  type: article-journal
  author: "Prokhorenkova, Liudmila and Gusev, Gleb and Vorobev, Aleksandr and Dorogush, Anna Veronika and Gulin, Andrey"
  issued:
    year: 2017
  URL: "https://arxiv.org/pdf/1706.09516.pdf"
- id: chen2016xgboost
  title: "XGBoost: A Scalable Tree Boosting System"
  author: "Chen, Tianqi and Guestrin, Carlos"
  type: article-journal
  issued:
    year: 2016
  URL: "https://arxiv.org/pdf/1603.02754.pdf"
- id: parr2019moml
  title: "The Mechanics of Machine Learning"
  author: "Parr, Terence and Howard, Jeremy"
  type: book
  issued:
    year: 2019
  URL: "https://mlbook.explained.ai/"
- id: vishnoi2020optimisation
  title: "Algorithms for Convex Optimization"
  author: "Nisheeth K. Vishnoi"
  type: book
  issued:
    year: 2020
  URL: "https://convex-optimization.github.io/"
- id: breiman2016randomforests
  title: "Random Forests"
  author: Leo Breiman
  issued:
    year: 2001
  URL: "https://link.springer.com/content/pdf/10.1023/A:1010933404324.pdf"
---

# ST310 Course Project

The aim of this project is to demonstrate the application of machine learning techniques discussed in the ST310 Machine Learning module, in turn drawing primarily from the exposition in [@james2017introduction] and [@hastie2009elements] among other texts. We examine the use of both linear techniques, e.g.  Logistic Regression,  and non-linear methods, e.g. Random Forests and Gradient Boosted Decision Trees, for a generic binary classification problem.
 
**Remark**: This report is best viewed as a `.html` file. If executing this notebooks as a `.Rmd` file, ensure that all libraries and dependencies are installed and that the chunks are executed in sequential order.

 <!-- This notebook is colour-blind friendly through the use of the default, grayscale, viridis color scales for charts (see [https://ggplot2.tidyverse.org/reference/scale_colour_continuous.html?q=color%20blind#arguments](#https://ggplot2.tidyverse.org/reference/scale_colour_continuous.html?q=color%20blind#arguments)). -->


## Dataset 

We obtain the dataset from the [Kaggle March Tabular Playground](#https://www.kaggle.com/c/tabular-playground-series-mar-2021/overview) competition. The data consists of *anonymised features*, which correspond to a binary outcome variable; in other words the task is a **classification problem**. Although the data is anonymised, the challenge description states:

> The dataset used for this competition is synthetic but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the amount of an insurance claim. Although the features are anonymized, they have properties relating to real-world features.

We believe that this machine learning task is well-motivated, as predicting the amount of an insurance claim or whether an insurance claim occurred, would have a meaningful business impact in the context of actuarial pricing and risk management.

```{r load-libs, warning=FALSE, message=FALSE}
# generic functions and plotting
library(tidyverse)
library(tidyr)
library(ggplot2) 
library(GGally) 
library(patchwork)
library(broom)
library(data.table)
library(doParallel)
library(foreach)
# modelling
library(caret)
library(tidymodels)
library(glmnet)
library(randomForest)
library(xgboost)
library(catboost)
# metrics
library(car) #outliers
library(pROC)
library(pdp)
# read in data
df <- read.csv(file = "../data/train.csv")
cat(c("The dimensions of the array are :", dim(df)[1], ", ", dim(df)[2]))
```
The dataset consists of 30 features,  19 *categorical variables* and 11 numerical variables, and a binary outcome variable `target`. We will need to process these categorical variables in some manner, which we shall consider in the [subsequent section](#preprocessing).

## Preprocessing
We remove the first column `id`, which represents an unique identifier for each observation. We then convert the first 19 columns, which are categorical variables into the *factor* data type in R. Given computational limitations (the code is executed on single CPU laptops), we subsample `n_train = 4000` observations from the complete dataset of 300,000 observations. For our subsequent analysis, we will use only this subsample of 4000 observations, although the approaches can be naturally extended to a larger dataset given greater computational resources.  We further partition the 4000 observations, into a train set of 3000 observations and a test set of  1000 observations. All training and tuning are performed on the training data, and we will consider model performance on the test set performance scores.

**Remark**: Different models would require different preprocessing for the categorical variables, and thus we will further address in subsequent sections; for example the `glm` package can accept *factors* as a data type, whereas for `glmnet` they must be converted to the model.matrix format.

**Remark**: When encoding the categorical variables as dummy variables in our subsequent analysis, we find that there are over 600 unique categories across all 19 of the categorical variables.  We could also consider different methods to process categorical variables, such as merging less frequently occurring categories, or  *target encoding* [@parr2019moml, ch. 6]. Nonetheless, we stick with the factors and dummies approach given our lack of familiarity with these other methods.

### Feature Engineering

A challenge in this case is that we have no *specific* domain knowledge about the meaning of particular features, and whether they may be useful. Thus in this problem, any preprocessing in terms of feature engineering, and a priori feature selection may not be that meaningful. Nonetheless, we can resort to model interpretability methods [@molnar2019] to obtain an ex-post explanation of our models. In the linear case, we can examine coefficients and their statistical significance, and in the case of tree-based models, such as Feature Importances, Partial Dependence Plots, and Shapley values.

```{r partition}
cat_feats = 1:19 # the first 19 columns are categorical 
cont_feats <- 20:30 # and the next 11 are continuous
target_col <- 31 # target
# convert cats to factors
df <- column_to_rownames(df, var = "id") %>% 
         mutate_if(is.character,as.factor) %>% 
         mutate_at(vars(target), factor)
# subsample the data for faster model inference
set.seed(1)
n_train <- 4000
df_sample = df[sample(1:nrow(df), n_train),]
# Partition data into train and test; test will be our oos data
set.seed(1)
df_split <- initial_split(df_sample, prop = 3/4)
df_train <- training(df_split)
df_test <- testing(df_split)
```

## Exploratory Data Analysis

We first conduct some *exploratory data analysis* by visualising the univariate and bivariate relationships in the dataset. This allows us to gain some understanding of the data and potential modelling approaches.

### Univariate EDA
We observe that the univariate distributions of the *continuous* variables are all multi-modal and non-normal, but they are all normalised (scaled) to the range of $[0, 1]$.
```{r cont-viz}
# flatten df into using pivot_longer and plot distribution
df %>% pivot_longer(cols = starts_with("cont"), names_to  = "cont") %>% 
   ggplot(aes(x = value))+
   geom_histogram(bins = 100, alpha = 0.85)+
   ggtitle("Continuous features distribution")+
   facet_wrap(cont~., scales = "free") +
   theme_minimal()
```
From the distributions of categorical variables, we see that there are variables with substantially more observations in one category, and also variables with a high number of ($>50$) categories, which will be an issue to address for any subsequent preprocessing. 
```{r cat-viz, fig.width=15, fig.height=15}
# flatten df into using pivot_longer and plot distribution
df %>% pivot_longer(cols = contains(c("cat", "target")), names_to  = "cat") %>% 
   ggplot(aes(x = value))+
   geom_bar(alpha = 0.85)+
   ggtitle("Categorical features distribution")+
   facet_wrap(cat~., scales = "free", ncol = 4) +
   theme_minimal(base_size = 30)
```



### Bivariate EDA

We group the continuous variables by the target and plot them as boxplots to check for any obvious differences discernible by eye. From the plots, claims with `target == 1` have lower values of `cont3` on average (median) than claims with `target == 0`. Claims with target=1 also have a much higher median value of cont4 than claims with `target == 0`. Hence, we would expect `cont3` and `cont4` to have negative relationships with target. In addition, we see a group of observations at the tail ends for `cont0, cont5, cont7, cont8, cont9, cont10`, which may be indicative of outliers. 

```{r cont-by-target}
# flatten df into using pivot_longer
# group by target and plot distribution
df[, c(cont_feats, target_col)] %>% 
  pivot_longer(cols = starts_with("cont"), names_to  = "var", values_to="value") %>% 
  ggplot(aes(x=target,y=value), fill=factor(value)) + 
  geom_boxplot() + coord_flip() + facet_wrap(~var, scales="free_x")
```

From our conditional boxplots, we can identify potential outliers by filtering observations that lie at the extreme percentiles of those continuous variables.  In particular we note that there are many values at the extreme quantiles for the variables `cont0, cont5, cont7, cont8, cont9, cont10`. We investigate this further using the *Hampel filter*, which considers points lying outside the median plus or minus 3 mean absolute deviations as outliers. Through this approach, more than 200 observations per variable would be identified as outliers, which suggests that these may not be outliers but that the distribution is just heavy-tailed. Without further information on the reasonable scale of values that individual variables can take (along with the fact that they are all normalised), we find it challenging to identify outliers through descriptive statistics and decide not to exclude any such points using this approach. We will proceed to detect outliers in another approach in [subsequent modelling](#outlier-detection).


```{r}
# MAD filter
hampel_filter <- function(df){
   lower_bound <- median(df) - 3 * mad(df, constant = 1)
   upper_bound <- median(df) + 3 * mad(df, constant = 1)
   outlier_ind <- which(df < lower_bound | df > upper_bound)
   return(outlier_ind)
}
# quantile filter
percentile_filter <- function(df, lq = 0.001, uq = 0.999){
   lower_bound <- quantile(df, lq)
   upper_bound <- quantile(df, uq)
   outlier_ind <- which(df < lower_bound | df > upper_bound)
   return(outlier_ind)
}
# outlier count
hampel_count <- function(x){length(hampel_filter(x))}
pct_count <- function(x){length(percentile_filter(x))}
outlier_counts <- df_train[, cont_feats] %>% map_dfr(hampel_count)
outlier_counts[2, ] <- df_train[, cont_feats] %>% map_dfr(pct_count)
outlier_counts
```
For categorical variables, we use stacked bar plots to show the percentages of observations in each category that correspond to `target == 0` and `target == 1` respectively. A much larger proportion of claims with `cat13 == B` appear to be associated with `target == 1` compared to `cat13 == A`. On the other hand, a much larger percentage of claims correspond to `target == 0` if `cat18` is `A` or `B`, than if `cat18` is `C` or `D`.

```{r cat-by-target, warning=FALSE, message=FALSE, fig.width=15, fig.height=15}
# flatten df into using pivot_longer
# group by target and plot distribution
df[, c(cat_feats, target_col)] %>% 
  pivot_longer(cols = starts_with("cat"), names_to  = "cat", values_to="value") %>% 
  ggplot(aes(x = value, fill=target)) + 
    geom_bar(position="fill") + 
    scale_y_continuous(name = "Within group Percentage", labels = scales::percent) +
    facet_wrap(~cat, scales="free_x", ncol = 4) +
    theme_minimal(base_size = 40)
```

We also inspect the correlation matrix for our *continuous* variables. There seems to be a cluster of variables `cont1, cont2, cont8` that are highly correlated with each other. This may be potentially indicative of multicollinearity, a concern when using linear models. In addition, we also consider the (Pearson) correlation and our continuous variables, although it is not necessarily meaningful in this case given our target is binary. 

We note that the continuous variables have Pearson correlations of roughly $[-0.2, 0.2]$, suggesting that there is some small signal or *information* between the features and the target. However, the Pearson correlation only describes a *linear* and *pairwise* association between the variables. As such there could also be complex non-linear associations and interactions between the variables, which the presence of multi-modal distributions of the continuous features in our univariate EDA may also affirm.

```{r cor-mat}
#heatmap, high values in red, low in yellow
cor_matrix <- cor(df[, cont_feats])
heatmap(cor_matrix, main="Correlation Matrix (Clustered)")
cor_with_target <- rownames_to_column(data.frame(cor(df[, c(cont_feats)], as.numeric(df$target))))
names(cor_with_target) <- c("feature", "pearson_corr")
cor_with_target %>%
  ggplot(aes(x=reorder(feature, -pearson_corr), y = pearson_corr)) + 
  geom_bar(stat='identity') + coord_flip() + ggtitle("Pearson Correlation of Continuous with Target")
```


We can use Principal Components Analysis (PCA) as a means to visualise the data in low-dimension, and examine whether there are any explicitly discernible trends. We apply PCA to all the continuous features, without further pre-scaling is required given the continuous features have values from $[0, 1]$.  By eye, there do not appear to be any significant difference in the PCA representations for each class, although there are many observations with `target == 1` closer to the bottom of the ellipsoid formed by the first two PCA loadings. At least in the space formed by the first two princial components, the classes do not appear to be linearly separable,  - which suggest a non-linear method may be more effective on this classification task. 

On examination of the *cumulative explained variance ratio*, the variation in $\mathbf{X}$ captured by $k$ principal components, we find that the  first two principal components only capture about $\approx 60\%$ variation in the *continuous variables*. To capture $\approx 90\%$ of the variation, we would need 6 of the continuous features, and for $\approx 95\% we would need 9. This could suggest that including more features could be beneficial, which would be a concern if we are to do  *feature selection*.

```{r pca-viz}
# pca loadings
pcs <- prcomp(df[,cont_feats])
set.seed(2021)
data.frame(pc1=pcs$x[,1], pc2=pcs$x[,2], target=df[, "target"]) %>%
ggplot(aes(x = pc1, y = pc2, colour = target)) + 
  geom_jitter(alpha=0.7) + ggtitle('Principal Components')
# cumulative explained variance ratio
cumul_var <- cumsum(pcs$sdev^2 / sum(pcs$sdev^2))
ggplot(data.frame(feature = 1:11, cumul_var = cumul_var)) + 
  geom_line(aes(x = feature,y = cumul_var)) + ggtitle("Cumulative Explained Variance Ratio") +
  scale_y_continuous(labels=percent) + xlab("Number of Principal Components") + 
  ylab("Cumulative explained Variance Ratio")
```

## Modelling

We first consider a naive model that always predicts a single label (in this case the greater accuracy is obtained by always predicting $0$, given it is the most frequently occurring class). The **accuracy** in this case, the proportion of correct labels $\sum_{i = 1}^{n} I(y_{i} = \hat{y_{i}})$, would be  $1 - \hat{y} = 0.745$. This illustrates a potential issue with the accuracy metric - a "high" accuracy may be reflective of the distribution of the target and not the model performance itself, so when comparing model *accuracies*, an appropriate baseline is needed.

We also consider the **ROC-AUC metric** (Receiving Operator Characteristic Area Under the Curve). Given the problem is one related to insurance, the modelling outcome of interest is not only to obtain the correct predictions (accuracy), but also accurate probabilities, which the AUC metric captures to some extent. 

The AUC metric calculates the area under the ROC-curve, the plot of the true positive rate or *sensitivity* $\frac{TP}{TP + FN}$ against the false positive rate $\frac{FN}{TN + FP}$, assessing the performance of the classifier's predicted probabilities across all possible decision thresholds. A classifier that always predicts 0s would result in a baseline AUC of $0.5$.

As we do not know the specific threshold relevant to the insurance context of the problem, we will report both accuracy, with a decision threshold set to $> 0.5$, and the AUC score for all models considered. In practice, the decision threshold would largely depend on the requirements of the modeller.


```{r}
# train-test
X_train = as.matrix(df_train[, cont_feats])
y_train = as.numeric(as.matrix(df_train$target))
X_test = as.matrix(df_test[, cont_feats])
y_test = as.numeric(as.matrix(df_test$target))
# helper function for metrics
evaluate_metrics <- function(train_pred, test_pred, train_true, test_true){
  train_classes <- ifelse(train_pred > 0.5, 1,0)
  test_classes <- ifelse(test_pred > 0.5, 1,0)
  acc1 <- mean(train_classes == train_true)
  auc1 <- auc(roc(train_true, train_pred, quiet=TRUE))
  acc2 <- mean(test_classes == test_true)
  auc2 <- auc(roc(test_true, test_pred, quiet=TRUE))
  data.frame(train_acc=acc1, train_auc=auc1, test_acc = acc2, test_auc=auc2)
}
# metrics for naive prediction
results = data.frame(evaluate_metrics(rep(0, dim(df_train)[1]), 
                               rep(0, dim(df_test)[1]), 
                               df_train$target, df_test$target),
                     row.names=c("naive"))
results
```

### Logistic Regression (SGD)

To fulfill the project requirements, we implement a `from scratch' Stochastic Gradient Descent routine  for Logistic Regression. The derivation follows [@hastie2009elements, pp. 120-126]

Consider a matrix of variables $X = (x_{1}, x_{2}, \ldots x_{n})^{T}$, where $x_{i}$ denote the i-th row or observation, the target $\mathbf{y} = (y_{1}, y_{2}, \ldots, y_{n})$. Let $\frac{p(x_{i}; \beta)}{1 - p(x_{i} ; \beta)} = exp(\beta^{T} x_{i})$  denote the odds ratio.

We are interested in finding the coefficients $\beta$, such that the **logistic loss** (also referred to as binary cross-entropy, and equivalent to the negative log-likelihood) is minimised. The logistic loss is given by:

$$\begin{align}l(\boldsymbol{\beta}) &= -\sum_{i = 1}^{N} y_{i} log(p(x_{i} ; \boldsymbol{\beta})) + (1 - y_{i}) log(1 - p(x_{i} ; \boldsymbol{\beta}))\\ 
&= \sum_{i = 1}^{N} \left [ y_{i}log \left (\frac{p(x_{i} ; \boldsymbol{\beta})}{1 - p(x_{i} ; \boldsymbol{\beta})} \right) + log(1 - p(x_{i} ; \boldsymbol{\beta})) \right ]\\
 l(\boldsymbol{\beta}) &= -\sum_{i = 1}^{N}\left [y_{i} \boldsymbol{\beta}^{T} x_{i} - log(1 + exp(\boldsymbol{\beta}^{T}x_{i})) \right ] \tag{1}
\end{align}$$ 

With the inclusion of a regularisation term, specifically a $L^{2}$ penalty, the loss function becomes:

$$ \begin{equation}l(\boldsymbol{\beta}) = -\sum_{i = 1}^{N} \left [y_{i} \boldsymbol{\beta}^{T} x_{i} - log(1 + exp(\boldsymbol{\beta}^{T}x_{i})) \right ] - \lambda \boldsymbol{\beta}^{T} \boldsymbol{\beta} \tag{2} \end{equation}$$ 

The gradient of the loss function with respect to the coefficients $\mathbf{\beta}$ is given by:

$$\nabla(\boldsymbol{\beta}) = -\sum_{i = 1}^{N} \left [ y_{i} x_{i} - \frac{x_{i}exp(\boldsymbol{\beta}^{T} x_{i})}{1 + exp(\boldsymbol{\beta}^{T} x_{i})} \right] - \lambda 2\boldsymbol{\beta}$$

We apply (stochastic) gradient descent. In short, this relies on updating the coefficients $\beta$ iteratively based on a step size $\lambda$:

$$\beta_{t + 1} = \beta_{t} - \lambda \nabla(\boldsymbol{\beta}_{t})$$

More sophisticated approaches such as Momentum or Adam can be used for the iterative updates; these are better outlined in [@vishnoi2020optimisation]. In our simple implementation, we use the Barzilai-Borwein method [@murphy2012machine, pp. 444-445] to determine the step size.

$$\lambda_{t} = \frac{|(\beta_{t} - \beta_{t - 1})^{T}(\nabla F(\beta_{t}) -  \nabla F(\beta_{t - 1}))|}{\| \nabla F(\beta_{t}) -  \nabla F(\beta_{t - 1}))\|^{2}}$$
Given sufficient iterations and appropriate step size, the model should converge (i.e. the change in the coefficients $\|\beta_{t + 1} - \beta_{t}\|$ or the change in loss is less than some threshold $\epsilon$), although this is not guaranteed.    

```{r gradient-descent}
# binary crossentropy / log-loss
log_loss <- function(x, y, betas, lambda){
  logits <- x %*% betas
  - (t(y) %*% logits - sum(log(1 + exp(logits))) + lambda * t(betas) %*% betas) / dim(x)[1]
}
# logistic regression gradients
gradients <- function(x, y, betas, lambda){
  logits <- x %*% betas
  - (t(x) %*% (y - exp(logits)/(1 + exp(logits)))) - lambda *2 * betas / dim(x)[1]
}
# pre set parameters
p = dim(X_train)[2]
lambda = 0
n_iters <- 100
init_step_size <- 1e-6
set.seed(2021)
beta_init <- matrix(rnorm(p),nrow=p)
beta_path <- matrix(rep(0, n_iters * p), nrow = n_iters, ncol=p)
beta_path[1,] = beta_init
last_grad <- grad <- gradients(X_train, y_train, beta_path[1,], lambda)
beta_path[2,] = beta_init - init_step_size * grad
grad <- gradients(X_train, y_train, beta_path[2,], lambda)
losses <- rep(0, n_iters)
# iteratively update betas
for (i in 3:n_iters){
    step_size <- as.numeric(t(beta_path[i - 1,] - beta_path[i - 2,]) %*% (grad - last_grad) / 
                    (t(grad - last_grad) %*% (grad - last_grad)))
    beta_path[i,] <- beta_path[i - 1,] - step_size * grad
    last_grad <- grad
    grad <- gradients(X_train, y_train, beta_path[i, ], lambda)
    losses[i] <- log_loss(X_train, y_train, beta_path[i,], lambda)
}
# plot + results
ggplot(data.frame(step = 3:n_iters, loss=losses[3:n_iters])) + 
  geom_line(aes(x = step, y = loss)) +
  ggtitle("Binary Crossentropy vs. Iterations")
pred_train <- as.numeric(1 / (1 + exp(-X_train %*% beta_path[100,])))
pred_test <- as.numeric(1 / (1 + exp(-X_test %*% beta_path[100,])))
results["sgd",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["sgd",]
```

We apply our "from scratch" implementation of Logistic Regression using SGD with all the numerical variables, obtaining a test accuracy of $0.751$ and a test AUC of $0.7082$, a small improvement in terms of accuracy from the naive strategy, but a larger improvement in terms of AUC.

**Remark**: We note that our implementation of Logistic Regression is quite limited, for example an inability to process the *factor* datatype (unless we append a one-hot encoded matrix), or an inability to examine coefficient standard errors (unless we change to a Hessian based approach / Newton's method). Given the limitations of our implementation, we shall subsequently use the `glm` package for logistic regression, and the `glmnet` package for regularised logistic regression.


### Logistic Regression (Few Predictors)

Having explored the efficacy of logistic regression on this modelling task, we now utilise a more robust implementation of logistic regression via the `glm` package. We consider a more parsimonious logistic regression model, hand-picking several features based on our exploratory data analysis. We select the features that have discernible differences in their distributions conditioned on the target:  the predictors `cont3, cont4, cat13, cat18`.

By selecting few predictors, coupled with the in-built features of the `glm` implementation, we can obtain, to some extent, *interpretability* and understanding of the model. This will help guide our analysis, and identify any potential issues before expanding to more predictors or different models.  We will also use this as a **baseline** to compare the performance of subsequent models with. 

```{r glm-baseline}
glm_base <- glm(target~cont3+cont4+cat13+cat18, data = df_train, family=binomial(link="logit"))
summary(glm_base)
vif(glm_base)
ggcoef(glm_base)
```


Firstly, the predictors are almost all *statistically significant* at a 5% level, with the exception of `cat18B` and the intercept. Inspecting the coefficient plots, which visualise the confidence intervals of each coefficient estimate, we observe that the confidence intervals of all variables except `cat18B` are significant at 5% level, and do not include 0. We also see that the width of the confidence intervals of the coefficients of `cat18D` and `cat18C`, which are proportional to their variances, are much larger than that of `cat13B`, `cont4` and `cont3`, however as the categorical and continuous variables are on completely different scales, these are not directly comparable. It could be that this is a result of subsampling, and that as we increase the sample size the standard errors of the coefficients would decrease.

Secondly, we inspect the *generalised variance inflation factor (GVIF)*. The *variance inflation factor* captures the extent to which the standard error of a predictor is increased due to its correlation with other predictors in the model, corrected by the number of degrees of freedom corresponding to the number of levels in the categorical variables. The GVIF of all predictors in our model are less than 2, which suggests there is no evidence of multicollinearity.  Although our analysis of the *variance inflation factor* suggests that there is no multicollinearity, a quick inspection of the variable `cat18` indicates that there are very few observations in the level `cat18A`; thus `cat18B` would be approximately close to `1 - cat18C - cat18D` and potentially collinear when the intercept is considered.s This suggests that a more robust method to encode the categorical variables could be required, such as merging infrequent categories,  or perhaps using regularisation instead.

```{r}
table(df_train$cat18)
```
  
Assuming that there is no multicollinearity, the coefficients for each predictor represents the association of that predictor with the log-odds of having `target == 1`:  $\log \left ( \frac{P(\text{target}=1)}{P\text{(target}=0)} \right )$.

The coefficients of the dummy variables indicate the average difference between the log odds of that factor level group compared to the baseline level group. In our regression, the first category (A) of each categorical variable is taken as the baseline group. For example, the coefficient of `cat13B` implies the following equation:
$$\log\frac{P(target=1|cat13=B)}{P(target=0|cat13=B)}-\log\frac{P(target=1|cat13=A)}{P(target=0|cat13=A)}=1.8681$$


This suggests that claims with `cat13=B` have $exp(1.8681)-1=5.48$ higher odds than claims with `cat13=A`. Given that the coefficient of `cat18B` is not statistically significant (or in other words, its standard error is high), this suggests that its association with target may not be significantly different from `cat18=A` (inspecting the conditional plot from our exploratory data analysis, their conditional distributions on the target appear to be similar). However, having `cat18=C` as opposed to having `cat18=A`  is associated with a $exp(2.6081)-1=12.6$ larger increase in the log-odds of having `target==1`. Similarly, claims with `cat18=D` are associated with a $exp(2.4342)-1=9.4$ greater log-likelihood of having `target==1`.

To interpret the coefficients of the continuous variables, log odds need to be calculated using specific pairs of values of the predictors. The non-linearity of the logistic function means that a change of one unit in the value of a predictor is not the same across the range of the predictor. `cont3` and `cont4` have negative coefficients as expected, and are interpreted as follows. A one unit increase in `cont3` is associated with an $exp(-0.6554)=0.519$ multiplicative effect on the odds, i.e. a 48% decrease in odds compared to the previous odds. Similarly, a claim with a one unit higher value of ‘cont4’ is $1 - exp(-0.4772)=37.9\%$ less likely to have `target==1` than a claim with a one unit lower value of `cont4`. As the true range of each continuous variable before normalisation is unknown, it is unclear whether this associated change in target is considered large in reality.

```{r message=FALSE, warning=FALSE}
pred_train <- predict(glm_base, df_train, type="response")
pred_test <- predict(glm_base, df_test, type="response")
results["glm-base",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["glm-base",]
```

With just a few predictors, the baseline model has a test accuracy of $80.3\%$, which is $7.5\%$ higher than the test accuracy of the naive model. The test AUC of the baseline model is $0.703$, which is 0.203 higher than that of the naive model, which is a fair improvement.

### Logistic Regression (All Predictors)

We now run a logistic regression using all 31 predictors. The regression coefficients can be interpreted in a similar way as the baseline model. The regression output shows that all predictors are significant at 5%, but this is unreliable since the categorical variables have many levels (623 in total). 

```{r glm-full-results, message=FALSE}
glm_full <- glm(target~., data=df_train, family=binomial(link="logit"), 
            control = list(maxit = 100))

pred_train <- predict(glm_full, df_train, type="response")
glm_full$xlevels = lapply(df[,cat_feats], levels)
pred_test <- predict(glm_full, df_test, type="response")
results["glm-full",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["glm-full",]
```

The full model has a higher train accuracy (0.801) but lower test accuracy (0.59) than the baseline model, which implies overfitting. The test AUC is 0.543, which is much lower than the baseline. This is an example of the bias-variance trade-off; the more complex, full model with 31 predictors has a lower bias but higher variance than the simpler, baseline model with 4 predictors. 

We also obtain an error: “prediction from a rank-deficient fit may be misleading” which suggests multicollinearity. In subsequent sections, we aim to overcome this to improve the stability of our model using shrinkage in a [subsequent section](#regularised-logistic-regression).

### Outlier Detection
At this stage, we suspect that some outliers might be heavily influencing the performance of our models, and hence we seek to detect outliers using a few measures. Formally, outliers are defined as observations with a response vector that is unusual conditional on covariates (predictors). Firstly, we look for points with large (studentised) residuals. We can then test if these residuals are significantly larger than those of other points by examining the Bonferroni-adjusted p-values. Ten points are identified to have large studentised residuals with adjust p-values less than 0.05. We store the indices of these points to be removed later.

```{r}
outlierTest(glm_full)
outliers <- as.numeric(names(outlierTest(glm_full)$p))
```

Observations that are far from the average covariate pattern are considered to have high leverage and can be measured using hat values. Here, there are at least 100 points with high leverage. Finally, we  also measure for influence, which is an observation that is an outlier and have high leverage. These are likely to influence the regression coefficients and influence can be thought of as the product of leverage and outlier. Here, we plot studentised residuals against hat-values with the size of a circle being proportional to the Cook's distance of an observation- a measure of influence.


```{r influence}
influenceIndexPlot(glm_full, vars = "hat")
influencePlot(glm_full)
```

We observe that there are 4 observations with high influence. We remove these observations along with the points identified as outliers earlier (14 in total), and compare the performance of our updated model with the original model (see [Appendix](#appendix)). 

We find that removing outliers (see [Appendix](#appendix)) leads to a higher test accuracy of 0.684 but a lower test AUC of 0.539. We cannot interpret this further without knowing the specific threshold used to classify the target in the data. Analysing the regression output, we see that leaving out the outliers does not change the coefficient estimates, and since we have no intuitive reason to believe that these outlier values are extreme (due to no knowledge about the variables themselves), we decided to keep these outlier data points for all other models. Furthermore, it is possible that these outliers may be less impactful as we increase the sample size.


### Regularised Logistic Regression

Given the issue of *high dimensionality*, we consider a regularised form of logistic regression. Specifically, we consider ridge (logistic) regression, with the functional form as in [equation (2)](#logistic-regression-\(sgd\)) This method shrinks coefficients by imposing a penalty on their size, thereby reducing model complexity.  We use the implementation offered by the  `glmnet` package; in doing so we need to convert the data type into matrices.The `glmnet` package requires the data to be in a *matrix* data type, and hence we make the corresponding adjustment.

```{r glm-ridge-scores, message=FALSE, warning=FALSE}
X_train = df_train[, -length(df_train)]
y_train <- df_train$target
X_test = df_test[, -length(df_test)]
y_test <- df_test$target
X_train = model.matrix(~., X_train)
X_test = model.matrix(~., X_test)
glm_reg <- cv.glmnet(X_train, y_train, 
                  family="binomial"(link="logit"), alpha=0)
pred_train <- as.numeric(predict(glm_reg, X_train, type="response"))
pred_test <- as.numeric(predict(glm_reg, X_test, type="response"))
results["glm-ridge",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["glm-ridge",]
```

### Random Forest

To improve performance, we draw on the usage of non-linear tree-based models, specifically random forests[@breiman2016randomforests]. Intuitively, a random forest averages different decision trees (known as bagging) so as to reduce the variance of individual trees. This thus reduces the likelihood of individual trees overfitting to the training data.

```{r rf ,message=FALSE, warning = FALSE}
rf_recipe <- recipe(target~., data = df_train)

rf_model <- 
  rand_forest(trees=100) %>%
  set_engine("ranger", importance="impurity", seed=2021) %>%
  set_mode("classification")

rf_workflow <- workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_model)

rf1 <- fit(rf_workflow, df_train)
ranger_obj <- pull_workflow_fit(rf1)$fit
# plot feature importances
rf_feat_imp <- rownames_to_column(data.frame(ranger_obj$variable.importance));
names(rf_feat_imp) <- c("feat", "importance")
ggplot(rf_feat_imp, aes(x = reorder(feat, importance), y = importance))+ 
      geom_bar(stat="identity", position="dodge")+ coord_flip()+
      ylab("Feature Importance (Gini Impurity)")+
      xlab("Feature")+
      ggtitle("Random Forest Feature Importances")
#evaluate metrics
pred_train <- unlist(predict(rf1,df_train,type='prob')[,2])
pred_test <- unlist(predict(rf1, df_test, type='prob')[,2])
results["rf",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["rf",]
```



```{r pdp, message=FALSE, warning=FALSE}
temp <- function(x){partial(ranger_obj, train=df_train, pred.var = x, plot = TRUE, plot.engine = "ggplot2", paropts = list(.packages = "ranger"))}
plots <- lapply(names(df_train)[cont_feats], temp)
wrap_plots(plots)
```

The random forest has a test accuracy of 0.836 and a test AUC of 0.872, which is higher than the previous linear models. However, the train accuracy and AUC are significantly higher than the test performance, which suggest that there may be some degree of overfitting to the train data. 

Random forests can be used to rank the importance of different features. Specifically, the x-axis is the Mean Decrease Accuracy, which reports how much accuracy the model loses when we exclude this variable. The more the accuracy falls by, the more important the particular variable is. Note here that for categorical variables, each level of the category is classified as a single variable. In this plot, we recorded the 30 most important variables. 

We then run another random forest with the most important features. In doing so, we hope to reduce the degree of overfitting by reducing the complexity of the model. However, it does not make sense to drop some levels of a categorical variable while including the other levels. Hence, as long as a level is present in the top 30 features, we will include the entire category in our updated random forest model. This results in us keeping only 22 variables, from an initial 30. 

Our reduced random forest has a slightly improved test accuracy and a slightly decreased test AUC. However, it does not solve the potential problem of overfitting as train performance is still significantly better than test performance. In fact, train performance on the reduced random forest is better than the random forest with a full set of variables - the reduced complexity of the model enabled it to have a lower bias on the training set. 


### XGBoost

For completeness, we also consider the `xgboost` library for **gradient boosted decision trees**. Gradient Boosted Decision Trees. The XGBoost package, introduced in [@chen2016xgboost] is a variant of Gradient Boosted Decision Trees [@hastie2009elements, pp. 353-374]. For Gradient Boosting Trees, each decision tree is trained sequentially on the *residuals* of the previous decision tree, which has the effect of reducing bias.

```{r xgb, message=FALSE, warning=FALSE}
dmy_train <- dummyVars("~.", data = df_train[,-length(df_train)])
dmy_test <- dummyVars("~.", data = df_test[,-length(df_test)])
X_train <- as.matrix(data.frame(predict(dmy_train,df_train)))
X_test <- as.matrix(data.frame(predict(dmy_test,df_test)))
y_train = as.integer(as.matrix(df_train$target))
y_test = as.integer(as.matrix(df_test$target))
bst <- xgboost(data = X_train, label=y_train, max_depth = 2, nround = 10, 
               verbose=0,
               objective='binary:logistic',
               eval_metric="logloss")

pred_train <- predict(bst, X_train, type="response")
pred_test <- predict(bst, X_test, type="response")
results["xgb",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["xgb",]

importance_matrix <- xgb.importance(model=bst)
xgb.plot.importance(importance_matrix)
# shapley values  
xgboost::xgb.ggplot.shap.summary(X_test, model = bst, 
                                 target_class = 1, top_n = 20)  # Summary plot
```

### CatBoost

We also consider briefly examine the `catboost` library for **gradient boosted decision trees**, given its popularity on machine learning competitions such as Kaggle. The CatBoost library has the advantage of learning a *target encoding*, i.e. some numerical value for every category for each  categorical variables. From an implementation point of view, this may reduce the preprocessing that may be required. For details on CatBoost, refer to [@prokhorenkova2017catboost]. 

Again, the key parameters are the number of trees (iterations), and the *depth* of each tree. Due to computational limtations, we select arbitrary parameters `iterations = 10` and `depth = 8`; these could be further fine-tuned using hyperparameter optimisation.

```{r cb, message=FALSE, warning=FALSE}
X_train = df_train[,c(cat_feats, cont_feats)]
y_train = as.integer(df_train$target)
X_test = df_test[,c(cat_feats, cont_feats)]
y_test = as.integer(df_test$target)

pool <- catboost.load_pool(X_train, y_train, cat_features = cat_feats)
model <- catboost.train(pool, params=list(depth = 8, iterations = 10, 
                                          loss_function='Logloss', verbose=0))

pred_train <- catboost.predict(model, catboost.load_pool(X_train), prediction_type = 'Probability')
pred_test <- catboost.predict(model, catboost.load_pool(X_test), prediction_type = 'Probability')

# feature importance
feat_importance <- catboost.get_feature_importance(model, pool)
importances <- data.frame(feat_importance[order(feat_importance, decreasing=FALSE),])
importances$features = rownames(importances)
names(importances) <- c("importance","features")
importances$features <- factor(importances$features, level=importances$features)
ggplot(importances, aes(x=importance, y=features)) + 
  geom_bar(stat="identity") +
  ggtitle("CatBoost Feature Importance")

#Shapley Values
data_shap_tree <- catboost.get_feature_importance(model, pool = pool,
                                                  type = "ShapValues")
data_shap_tree <- data.frame(data_shap_tree[, -ncol(data_shap_tree)]) 
names(data_shap_tree) = names(df[, c(cat_feats, cont_feats)])

# ggplot(stack(data_shap_tree), aes(x = ind, y = values)) +
#     geom_point(aes(color = values)) + coord_flip() + 
#     ggtitle("Shapley Values by variable")  + scale_color_viridis_c()

results["catboost",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["catboost",]
```

We obtain a train accuracy of $0.882$ and a train AUC of $0.921$, versus a test accuracy of $0.836$ and test AUC $0.858$. This suggests that the model may be overfitting for our particular choice of paramters (and also for our subsampled dataset size).

For interpretability, we can use CatBoost's *feature importance*. From the [CatBoost documentation](#https://catboost.ai/docs/concepts/fstr.html), CatBoost's in-built and default feature importance does the following:

> For each feature, PredictionValuesChange shows how much on average the prediction changes if the feature value changes. The bigger the value of the importance the bigger on average is the change to the prediction value, if this feature is changed.

We could also set the type of feature importance to Shapley Values, although we omit the plot in this case due to the lack of in-built support for plotting.

## Results and Evaluation

```{r results, warning=FALSE, message=FALSE}
results[order(results$test_auc, results$test_acc),]
```

Although the tree-based methods are less interpretable and more "black box" to some extent, we can make use of model interpretability techniques.

It could be that our resutls are not fully robust, given we have subsampled the original data.

## Conclusion

In this project, we demonstrated the use of logistic regression, gradient descent, regularisation, random forest, gradient boosting and hyperparameter tuning techniques to achieve interpretability and model accuracy (in separate cases) in predicting the amount of an insurance claim. Model ___ was the model with highest AUC, although interpretability was sacrificed. On the other hand, our baseline model with only a few predictors is easy to interpret, however it has a relatively low AUC of 0.703. In practice, depending on whether the goals of the company lean towards prediction or explanation, an appropriate trade-off between model complexity and interpretability has to be chosen. 

From the insurer’s profit perspective, understanding which characteristics are highly correlated with large amounts of claims could also be used to inform whether or not a particular insurance application is accepted, even if causality cannot be inferred. However, this might lead to some legal and/or ethical issues. For example, the UK’s 2010 Equality Act makes discriminating on nine protected characteristics, including age, gender, race and religion, illegal (Pattni, 2020). Therefore, proper data governance needs to be maintained and privacy impact assessments need to be conducted before any machine learning models of this kind are deployed in industry (GDPR, 2018).



## Bibliography

::: {#refs}
:::

## Appendix

### Comparison of Coefficients with Outliers Removed

```{r removed-outliers}
influencers <- as.numeric(rownames(influencePlot(glm_full)))
glm_full_influencers <- update(glm_full, subset = c(-influencers))
glm_full_outliers <- update(glm_full, subset = c(-outliers))
removal_list <- union(outliers, influencers)
glm_full_removed <- update(glm_full, subset = c(-removal_list))
compareCoefs(glm_full, glm_full_influencers, glm_full_outliers, glm_full_removed)
```

```{r}
set.seed(2021)
cv_splits <- rsample::vfold_cv(df_train, strata = target, v= 3)
mod <- logistic_reg(penalty = tune(),
                    mixture = tune()) %>%
  set_engine("glmnet")

glmnet_recipe <- recipe(target~.,data = df_train) %>% 
  step_dummy(all_nominal(), -all_outcomes())

glmnet_workflow<- workflow() %>%
  add_recipe(glmnet_recipe) %>%
  add_model(mod)

glmn_set <- parameters(penalty(range = c(-5,1), trans = log10_trans()),
                       mixture())

glmn_grid <- 
  grid_regular(glmn_set, levels = c(7, 5))
ctrl <- control_grid(save_pred = TRUE, verbose = TRUE)

glmn_tune <- 
  tune_grid(glmnet_workflow,
            resamples = cv_splits,
            grid = glmn_grid,
            metrics = metric_set(roc_auc),
            control = ctrl)


best_glmn <- select_best(glmn_tune, metric = "roc_auc")

glmnet_model <- 
  logistic_reg() %>%
  set_engine("glmnet", seed=2021) %>%
  set_mode("classification")



glm_reg <- fit(glmnet_workflow, data = df_train)
```


```{r}
xgb_spec <- boost_tree(
  trees = 100, 
  tree_depth = tune(),
  learn_rate = tune()                         ## step size
) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  learn_rate(),
  size = 5
)

xgb_wf <- workflow() %>%
  add_formula(target ~ .) %>%
  add_model(xgb_spec)

set.seed(123)
vb_folds <- vfold_cv(df_train, strata = target)

doParallel::registerDoParallel()

set.seed(234)
xgb_res <- tune_grid(
  xgb_wf,
  resamples = vb_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)
best_auc <- select_best(xgb_res, "roc_auc")
final_xgb <- finalize_workflow(
  xgb_wf,
  best_auc
)
library(vip)

final_xgb %>%
  fit(data = vb_train) %>%
  pull_workflow_fit() %>%
  vip(geom = "point")

final_res <- last_fit(final_xgb, df_split)

collect_metrics(final_res)

xgb_res %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(mean, mtry:sample_size) %>%
  pivot_longer(mtry:sample_size,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC")
```

```{r}
"https://curso-r.github.io/treesnip/index.html"
library(treesnip)
```
